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Support Vector Regression 


Roadmap 
@ Embedding Numerous Features: Kernel Models 


Lecture 5: Kernel Logistic Regression 
two-level learning for SVM-like sparse model for 
soft classification, or using representer theorem 

with regularized logistic error for dense model 












Lecture 6: Support Vector Regression 


e Kernel Ridge Regression 
e Support Vector Regression Primal 
e Support Vector Regression Dual 

e Summary of Kernel Models 





6 Combining Predictive Features: Aggregation Models 
© Distilling Implicit Features: Extraction Models 
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Support Vector Regression Kernel Ridge Regression 


Recall: Representer Theorem 
for any L2-regularized linear model 









N 
"E T E 
min — xw iu Ud Zn) 


optimal w, = Y^" , BZ». 
—any L2-regularized linear model can be kernelized! 


regression with squared error 





err(y, w/z) En (Y e w/z)? 


—analytic solution for linear/ridge regression 


analytic solution for kernel ridge regression? 
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Support Vector Regression Kernel Ridge Regression 


Kernel Ridge Regression Problem 


N 
H H H . A T. 1 F. 2 
solving ridge regression min nw w+ N x (Yn — W' Zp) 





N 
yields optimal solution w. = >” 55Z5 
n=l 





with out loss of generality, can solve for optimal 3 instead of w 


A N N 4 N N 
um. N `> y BnBmK (Xn, Xm) RE N Ne (v = 23 BmK (Xn, z) : 
n=l m=1 


qme 
N. 


regularization of 8 on K-based regularizer linear regression of 6 on K-based features 


À 1 
y a+ x (8TKTK8 — 287 KTy + y") 





kernel ridge regression: 
use representer theorem for kernel trick on ridge regression 
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Support Vector Regression Kernel Ridge Regression 


Solving Kernel Ridge Regression 


A 1 
Eaug(3) = JOKE + (87 k8-287Kk "y + yy) 


VEaug(B) = = (AK718 +K'K8- Ky) = =x (Qu 4 K)8 — y) 





want V Eaug( 6) = 0: one analytic solution 


B — QI K) ly 

e (-)7* always exists for A > 0, because 
K positive semi-definite (Mercer's condition, remember? :-)) 

e time complexity: O(N°) with simple dense matrix inversion 





can now do non-linear regression ‘easily’ | 
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Support Vector Regression Kernel Ridge Regression 


Linear versus Kernel Ridge Regression 











linear ridge regression 





kernel ridge regression 








WIEN E Xm B = (AM+K) ly 
e more restricted e more flexible with K 
e O(d? + d?N) training; e O(N5) training; 

O(d) prediction O(N) prediction 


—efficient when N > d —hard for big data 





linear versus kernel: 
trade-off between efficiency and flexibility 
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Support Vector Regression Kernel Ridge Regression 


Fun Time 


After getting the optimal 8 from kernel ridge regression based on some 
kernel function K, what is the resulting g(x)? 


e» NE BnK(Xn, X) 

e 2o YnBnK(Xn, X) 

© D BnK (Xn, X) +A 
o 22 YnBnK (Xn, X) +A 





Support Vector Regression Kernel Ridge Regression 


Fun Time 


After getting the optimal 8 from kernel ridge regression based on some 


kernel function K, what is the resulting g(x)? 
O Zn Bak (Xn, X) 
e a YnBnK (Xn, X) 
© Ya Bak (Kn, X) +A 
o pm YnBnK (Xn, X) +A 








Reference Answer: (4) 
Recall that the optimal w = M e BnZn by 


representer theorem and g(x) = w/z. The 
answer comes from combining the two 
equations with the kernel trick. 
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Support Vector Regression Support Vector Regression Primal 


Soft-Margin SVM versus Least-Squares SVM 


least-squares SVM (LSSVM) 
= kernel ridge regression for classification | 





soft-margin Gaussian SVM Gaussian LSSVM 


e LSSVM: similar boundary, many more SVs 
=> slower prediction, dense 8 (BIG g) 


e dense 3: LSSVM, kernel LogReg; 
sparse a: standard SVM 


want: sparse 6 like standard SVM | 
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Support Vector Regression Support Vector Regression Primal 


Tube Regression 


will consider tube regression 
e within a tube: no error 
e outside a tube: error by distance to tube 









error measure: 





em(y,s) = max(0,|s—y|—¢) 
e |s-—y|<e0 
see Eee 


—usually called e-insensitive error with e > 0 


todo: L2-regularized tube regression 
to get sparse 3 | 


Support Vector Regression Support Vector Regression Primal 


Tube versus Squared Regression 





(s- y)? 


— squared 
— tube 
tube ~ squared when |s — y| small 
& less affected by outliers 
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Support Vector Regression Support Vector Regression Primal 


L2-Regularized Tube Regression 


N 
min Aww $ i3 max (o. Iw7z, — y| — e) 














n=1 








standard SVM 
min 2wTw + C >> margin vio. 


Regularized Tube Regr. 





min ¿ww + 4, >> tube violation 
e not differentiable, 
but QP 


e dual to kernelize, 
KKT conditions => sparsity 


e unconstrained, 
but max not differentiable 


e 'representer to kernelize, 
but no obvious sparsity 


will mimic standard SVM derivation: 


N 
min Sw" + CM max (o. (Woz, + b — yn| — e) 


b,w 






n=1 
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Support Vector Regression Support Vector Regression Primal 


Standard Support Vector Regression Primal 


N 
min Sw" + CO max (0, Wize b= yl = e) 


bw 






n=1 










mimicking standard SVM making constraints linear 






1 N 
Sw'wtC) (E +h) 


n=1 


N 
NN ME 

min -w'w-+C 
Bw. 25 wt dn 
xr yn Wz pe eee 


Sle, 0 


S.t. wz, + b — ynl SESE S| 
En = 0 


Support Vector Regression (SVR) primal: 
minimize regularizer + (upper tube violations £7 & lower violations £y) 





Support Vector Regression Support Vector Regression Primal 


Quadratic Programming for SVR 


N 
] 1 IB V ^ 
min -W W > 
pws BU WT 2 eon 
S.t. —e— Ey E yn - Wzp- bE e £j 


En 206, 20 





e parameter C: trade-off of regularization & 9 
tube violation 

e parameter e: vertical tube width 
—one more parameter to choose! 

e QPofd+1+2N variables, 2N + 2N 

constraints 


next: remove dependence on d by 
SVR primal = dual? 
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Support Vector Regression Support Vector Regression Primal 


Fun Time 


Consider solving support vector regression with e = 0.05. At the 
optimal solution, assume that w/z, + b = 1.234 and y, = 1.126. What 
is Ef and Ef? 

O ¿7 = 0.108, ¿^ = 0.000 
O ¿7 = 0.000, ¿^ = 0.108 
© ¿7 = 0.058, ¿^ = 0.000 
O ¿7 = 0.000, ¿^ = 0.058 





Support Vector Regression Support Vector Regression Primal 


Fun Time 


Consider solving support vector regression with e = 0.05. At the 
optimal solution, assume that w/z, + b = 1.234 and y, = 1.126. What 
is Ef and £j? 

O ¿7 = 0.108, ¿^ = 0.000 

O ¿7 = 0.000, ¿^ = 0.108 

© ¿7 = 0.058, ¿^ = 0.000 

O ¿7 = 0.000, ¿^ = 0.058 





Reference Answer: 9 


yi — wz; — b = —0.108 < —0.05, which 
means that there is a lower tube violation of 
amount 0.058. When there is a lower tube 
violation on some example, trivially there is no 
upper tube violation. 
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Support Vector Regression Support Vector Regression Dual 


Lagrange Multipliers a^ & a” 






N 
e 1 r 
objective function ¿Mw + 2 (£y -- £5) 
Lagrange multiplier a^ for y,—w zs wies En 


Lagrange multiplier ay for —e— ¿Y < yg —w =D 














Some of the KKT Conditions 





N N 
e du = 0: w= Y (on — of) 2h 3 of — 0: Y: (o5 - oy) 20 


I 
o 


a = ce 
=== 0 


standard dual can be derived 
using the same steps as Lecture 4 | 
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* complementary slackness: 





Support Vector Regression Support Vector Regression Dual 


SVM Dual and SVR Dual 


; UN. a E N 
min ¿w OE min 5w "C (En +65) 
st y,(w'z, +b) 2 1—£&, st. 1(yo,-w'z,—b) € e-£^ 
E IW z,-- D Yn) «eo E 





£060 2 0 















N N 
ae "eel 
min 5 2. 2 onomynymK (Xn, Xm) min 5 > = » (ah — oy Mah — on, )Kn,m 





N 
= Ds i * On 
n=1 
N 
s.t. b Ynan — 0 
n=1 


0<an<C 


N 
DO ((e= Yn) -ah + (e+ Yn) ax) 
n=1 


N 
s.t. S74 (85 O 


«o5 «€ (0:10) «& vi <a 


similar QP, solvable by similar solver 





Support Vector Regression Support Vector Regression Dual 


Sparsity of SVR Solution 


complementary slackness: 


ae EN, wz, b) 
ale + £y + yn — w*Zp — b) 


| 
o 





strictly within tube |w"z, + b — yn| < e 

=> E =0and e = 

= (e+ £^ — y,+w'z,4+ b) 20 and (e+ £Y + yp—w'2z,)— b) 40 
=o, =D anda, —0 

=> 55 =0 


SVs (8n 4 0): on or outside tube 





SVR: allows sparse 3 


Support Vector Regression Support Vector Regression Dual 


Fun Time 


What is the number of variables within the QP problem of SVR dual? 
9 d+1 
@d+1+2N 
ON 
O 2N 





Support Vector Regression Support Vector Regression Dual 


Fun Time 


What is the number of variables within the QP problem of SVR dual? 
O d+1 
@d+1+2N 
ON 
O 2N 








Reference Answer: O 


There are N variables within a“, and another 
N in o^. 
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Support Vector Regression Summary of Kernel Models 


Map of Linear Models 






linear SVR 
minimize regularized 
errzyge by QP 






PLA/pocket 
minimize 
erro, specially 



















regularized logistic 
regression 


minimize regularized 
errcE by GD/SGD 





linear ridge 
regression 


minimize regularized 
errsar analytically 


second row: popular in LIBLINEAR 


linear soft-margin 
SVM 

minimize regularized 
ElTsvm by QP 
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Support Vector Regression Summary of Kernel Models 


Map of Linear/Kernel Models 


PLA/pocket 


linear soft-margin 
SVM 


kernelized linear ridge | kernelized regularized 


regression logistic regression 
SVM 
minimize SVM dual by | minimize SVR dual by | run SVM-transformed 
QP QP logistic regression 


fourth row: popular in 
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Support Vector Regression Summary of Kernel Models 


Map of Linear/Kernel Models 


PLA/pocket 


linear soft-margin 
SVM 


first row: less used due to 
third row: less used due to 
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Support Vector Regression Summary of Kernel Models 


Kernel Models 
possible kernels: 


polynomial, Gaussian, ..., your design (with Mercer's condition), 


coupled with 





kernel ridge kernel logistic 
regression regression 





SVM SVR 





probabilistic SVM 





powerful extension of linear models 
—with great power comes great responsibility 
in Spiderman, remember? :-) 





Support Vector Regression Summary of Kernel Models 


Fun Time 


Which of the following model is less used in practice? 
O pocket 
@ ridge regression 
6 (linear or kernel) soft-margin SVM 
© regularized logistic regression 





Support Vector Regression Summary of Kernel Models 


Fun Time 


Which of the following model is less used in practice? 
O pocket 
@ ridge regression 
6 (linear or kernel) soft-margin SVM 
© regularized logistic regression 











Reference Answer: (4) 


The pocket algorithm generally does not 
perform better than linear soft-margin SVM, 
and hence is less used in practice. 
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Summary 
@ Embedding Numerous Features: Kernel Models 


Lecture 6: Support Vector Regression 


e Kernel Ridge Regression 
representer theorem on ridge regression 
e Support Vector Regression Primal 
minimize regularized tube errors 
e Support Vector Regression Dual 
a QP similar to SVM dual 
e Summary of Kernel Models 
with great power comes great responsibility 





@ Combining Predictive Features: Aggregation Models 
e next: making cocktail from learning models 


© Distilling Implicit Features: Extraction Models 





